James O’Brien
21211036
Graduate Certificate
In
Artificial Intelligence
Full Analysis
Iris Identification
Introduction
Question: How can botanists identify three species of Iris in the field based on
measurements?
Iris virginica
Iris versicolour
Iris setosa
Dataset
Fisher, R. A. (1936). The use of multiple measurements in taxonomic problems. Annals of eugenics,7(2), 179-188v
Fisher (1936) collected the data in the field to highlight problems of identifying species.
50 observations for each species of sepal width, sepal length, petal width, and petal length.
Sepal morphology is unusual in Iris, being larger and more ornate than inner petals which it usually
protects.
Cleaning
Outlier with a measurement of 540 meters for a sepal length (Iris giganticaerulea is 30 cm long) and two missing
values for sepal width and petal length within I. setosa.
Error and missing values (MCAR) were imputed based on the mean of shared values that were present.
Exploratory Data Analysis
Boxplots show I. setosa delineating based on petal length and petal width. Difficult to discern I. virginica and I.
versicolor due to overlap of measurements outside of quartile ranges.
Models
Cluster analysis shows if petal length is less than 2.5 cm or/and petal width is less than 0.7 cm then species is I.
setosa. The intersection of I. versicolor and I. virginica for a small number of observations cannot be fully
differentiated.
The binary problem of identifying I. versicolor or I. virginica tried to be solved using logistic regression. Petal length
and petal width have the sharpest delineation compared to the gradual overlaps of sepal measurements with petal
width having less error. Numbers rounded to one decimal place.
Results
Derived plant key from logistic regression:
If petal length is less than 2.5 cm: I. setosa
If petal width is greater than 1.6 cm and petal length is greater than 4.9 cm: I. versicolor
If petal width is less than 1.6 cm and petal length is less than 4.9 cm: I. virginica
Using this key: identifies all 50 I.setosa; only 43 I. virginica correctly but with one additional I. versicolor; and only
identifies 43 I. versicolor. 13 plants would be unidentifiable by morphology alone with one error. Decision trees
methods naturally lend to plant classification but all plants are categorized increasing the error and different
measurements are suggested based on different algorithms.
Conclusion
Using “less than and equal to” increases the error and the precision of exact measurement of 16 mm would not be
available in the field or standardized across botanists. Identifying a smaller number correctly may be more
beneficial than identifying all with errors.
Attempts to classify all plants based on morphology, where possible hybridization (Anderson, 1936) is occurring,
will lead to misclassification.
Using decision trees can lead to different results and dimension criteria (cm) happening within a “black box” which
may be difficult to explain.
Explaining to other botanists or taxonomists may be easier using simpler methods and accepting that plants are
difficult (impossible) to identify by morphology alone.
Anderson, E. (1936). The species problem in Iris. Annals of the Missouri Botanical Garden,23(3), 457-509.